1 Large-Language Models (LLMs)

Table 1.1: LLMs
Provider Model Version Estimate Rank
1 anthropic Claude 3.7 Sonnet claude-3-7-sonnet-20250219 3.8580848 top
2 anthropic Claude 3.5 Sonnet claude-3-5-sonnet-20241022 3.4210271 top
3 xai Grok 3 Beta grok-3-beta 3.0488472 top
4 anthropic Claude 3 Haiku claude-3-haiku-20240307 0.3764656 bottom
5 cohere Command R command-r-08-2024 0.3764656 bottom
6 openai GPT-3.5 Turbo gpt-3.5-turbo 0.3299676 bottom
7 openai GPT-4o Mini gpt-4o-mini 0.2865677 bottom
8 google Gemini 2.5 Flash gemini-2.5-flash NA new

Building on our previous analysis, we selected models based on their performance. We chose 4 top1, which were consistently more consistent than chance, and 4 bottom models, which were consistently less consistent than chance in terms of deliberative reasoning.

2 Cases

Table 2.1: Cases
Case Survey N Participants
1 CCPS ACT Deliberative ccps 31
2 CSIRO WA energy_futures 17
3 Winterthur zh_winterthur 16

3 Surveys

Table 3.1: Surveys
survey considerations policies scale_max q_method
1 ccps 33 7 11 FALSE
2 energy_futures 45 9 11 FALSE
3 zh_winterthur 30 6 7 FALSE

4 Roles

Table 4.1: Roles
uid type article role description
1 eco ideology an ecologist focuses on environmental protection and sustainability, advocating for societal change to ecological limits
2 coa perspective a coastal resident endures chronic flooding and salinization, forced to relocate due to rising sea levels and intense storms worsened by climate change
3 ctr perspective a construction worker suffers from extreme heat stress and lost work hours, perceiving climate change making outdoor labor unbearable and life-threatening
4 dis perspective a disease survivor recovers from dengue fever, aware that climate change’s rising temperatures are expanding the range of disease-carrying mosquitoes in their region
5 eld perspective an elderly urban resident endures intensified city heatwaves, struggling with disrupted services and feeling the direct, severe impact of climate change
6 far perspective a displaced family loses their home due to unprecedented wildfires, experiencing displacement and recognizing climate change as the major driver of the devastation
7 fis perspective a fisher notes his declining catches due to warming oceans, understanding that climate change is reorganizing marine life and reducing their traditional yield
8 lan perspective a landowner surveys his parched fields after a prolonged drought, feeling the compounding impacts of climate change that reduce crop yields and family income
9 par perspective a parent sees their child fall ill from a water-borne disease, attributing its spread to the increased heavy rainfall and warmer temperatures brought by climate change
10 sub perspective a subsistence farmer watches his crops wither under erratic rainfall patterns, and who sees these changes as direct consequence of climate change
11 vil perspective a villager faces dwindling, contaminated water supplies due to extended draughts and floods, aware that climate change is altering their water security
12 csk devils a climate skeptic prioritizes economic growth over CO2 emission cuts, fossil fuels over renewable energy, and does not believe in climate science

5 Methods

5.1 Data collection

We collected 1440 responses generated by 8 models cross 3 surveys and 12 roles described above. We prompted each LLM 5 times with the same prompt.

5.1.1 System prompt (Roles)

We instructed LLMs to play each of the roles described above by including a system instruction in each request following the pattern:

Answer the following prompts as [article] [role], who [description].

For example:

Answer the following prompts as a climate skeptic, who prioritizes economic growth over CO2 emission cuts, fossil fuels over renewable energy, and does not believe in climate science.

5.2 Analysis

We calculated one DRI value per model/survey/role by treating each LLM response as one participant in a deliberation. The role “all” indicates that all roles were part of that deliberation (n = 60 participants, which equals 5 participants for each of the 12 roles). DRI plots are shown in Figure 7.3.

6 Hypotheses Testing

6.1 H1a: random data

6.2 H1b: one-sample Wilcoxon signed rank test

model survey obs_mean N mu p_value_two.sided sig_two.sided p_value_greater sig_greater
Claude 3.5 Sonnet ccps 0.3759073 12 0 0.0009766 * 0.0004883 *
Claude 3.5 Sonnet energy_futures 0.4695921 12 0 0.0009766 * 0.0004883 *
Claude 3.5 Sonnet zh_winterthur 0.5683774 12 0 0.0004883 * 0.0002441 *
Claude 3.7 Sonnet ccps 0.6819898 12 0 0.0004883 * 0.0002441 *
Claude 3.7 Sonnet energy_futures 0.6173198 12 0 0.0004883 * 0.0002441 *
Claude 3.7 Sonnet zh_winterthur 0.5911667 12 0 0.0004883 * 0.0002441 *
Grok 3 Beta ccps 0.3605863 12 0 0.0004883 * 0.0002441 *
Grok 3 Beta energy_futures 0.7103851 12 0 0.0004883 * 0.0002441 *
Grok 3 Beta zh_winterthur 0.7314191 12 0 0.0004883 * 0.0002441 *
Gemini 2.5 Flash ccps 0.8336696 12 0 0.0004883 * 0.0002441 *
Gemini 2.5 Flash energy_futures 0.5166190 12 0 0.0009766 * 0.0004883 *
Gemini 2.5 Flash zh_winterthur 0.6778375 12 0 0.0004883 * 0.0002441 *
GPT-4o Mini ccps 0.0427425 12 0 0.6772461 n.s. 0.3386230 n.s.
GPT-4o Mini energy_futures -0.0899976 12 0 0.5693359 n.s. 0.7407227 n.s.
GPT-4o Mini zh_winterthur -0.2190937 12 0 0.0771484 n.s. 0.9680176 n.s.
GPT-3.5 Turbo ccps -0.2532340 12 0 0.0161133 * 0.9938965 n.s.
GPT-3.5 Turbo energy_futures -0.2836284 12 0 0.0122070 * 0.9953613 n.s.
GPT-3.5 Turbo zh_winterthur -0.4205772 12 0 0.0034180 * 0.9987793 n.s.
Command R ccps -0.4709172 12 0 0.0004883 * 1.0000000 n.s.
Command R energy_futures -0.0245292 12 0 0.7910156 n.s. 0.6333008 n.s.
Command R zh_winterthur -0.9582444 12 0 0.0004883 * 1.0000000 n.s.
Claude 3 Haiku ccps -0.3105968 12 0 0.0004883 * 1.0000000 n.s.
Claude 3 Haiku energy_futures -0.3584220 12 0 0.0009766 * 0.9997559 n.s.
Claude 3 Haiku zh_winterthur -0.6380549 12 0 0.0004883 * 1.0000000 n.s.

6.3 H2

## Linear mixed model fit by REML ['lmerMod']
## Formula: dri ~ model + (1 | role) + (1 | survey)
##    Data: df
## 
## REML criterion at convergence: 127.1
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.86198 -0.63430  0.03286  0.59691  3.03838 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  role     (Intercept) 0.002483 0.04983 
##  survey   (Intercept) 0.005538 0.07442 
##  Residual             0.080233 0.28326 
## Number of obs: 288, groups:  role, 12; survey, 3
## 
## Fixed effects:
##                        Estimate Std. Error t value
## (Intercept)            -0.43569    0.06544  -6.658
## modelClaude 3.5 Sonnet  0.90698    0.06676  13.585
## modelClaude 3.7 Sonnet  1.06585    0.06676  15.964
## modelCommand R         -0.04887    0.06676  -0.732
## modelGemini 2.5 Flash   1.11173    0.06676  16.652
## modelGPT-3.5 Turbo      0.11654    0.06676   1.746
## modelGPT-4o Mini        0.34691    0.06676   5.196
## modelGrok 3 Beta        1.03649    0.06676  15.525
## 
## Correlation of Fixed Effects:
##             (Intr) mC3.5S mC3.7S mdlCmR mG2.5F mGPT-T mGPT-M
## mdlCld3.5Sn -0.510                                          
## mdlCld3.7Sn -0.510  0.500                                   
## modelCmmndR -0.510  0.500  0.500                            
## mdlGmn2.5Fl -0.510  0.500  0.500  0.500                     
## mdlGPT-3.5T -0.510  0.500  0.500  0.500  0.500              
## modlGPT-4Mn -0.510  0.500  0.500  0.500  0.500  0.500       
## modelGrk3Bt -0.510  0.500  0.500  0.500  0.500  0.500  0.500
## Linear mixed model fit by REML ['lmerMod']
## Formula: dri ~ model + (1 | survey/role)
##    Data: df
## 
## REML criterion at convergence: 128.9
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.95607 -0.66969 -0.00041  0.65619  3.06045 
## 
## Random effects:
##  Groups      Name        Variance  Std.Dev.
##  role:survey (Intercept) 0.0009013 0.03002 
##  survey      (Intercept) 0.0054477 0.07381 
##  Residual                0.0817355 0.28589 
## Number of obs: 288, groups:  role:survey, 36; survey, 3
## 
## Fixed effects:
##                        Estimate Std. Error t value
## (Intercept)            -0.43569    0.06412  -6.795
## modelClaude 3.5 Sonnet  0.90698    0.06739  13.460
## modelClaude 3.7 Sonnet  1.06585    0.06739  15.817
## modelCommand R         -0.04887    0.06739  -0.725
## modelGemini 2.5 Flash   1.11173    0.06739  16.498
## modelGPT-3.5 Turbo      0.11654    0.06739   1.730
## modelGPT-4o Mini        0.34691    0.06739   5.148
## modelGrok 3 Beta        1.03649    0.06739  15.381
## 
## Correlation of Fixed Effects:
##             (Intr) mC3.5S mC3.7S mdlCmR mG2.5F mGPT-T mGPT-M
## mdlCld3.5Sn -0.525                                          
## mdlCld3.7Sn -0.525  0.500                                   
## modelCmmndR -0.525  0.500  0.500                            
## mdlGmn2.5Fl -0.525  0.500  0.500  0.500                     
## mdlGPT-3.5T -0.525  0.500  0.500  0.500  0.500              
## modlGPT-4Mn -0.525  0.500  0.500  0.500  0.500  0.500       
## modelGrk3Bt -0.525  0.500  0.500  0.500  0.500  0.500  0.500
## boundary (singular) fit: see help('isSingular')
## refitting model(s) with ML (instead of REML)
## Data: df
## Models:
## m0: dri ~ 1 + (1 | survey/role)
## m1: dri ~ model + (1 | survey/role)
##    npar    AIC    BIC   logLik -2*log(L)  Chisq Df Pr(>Chisq)    
## m0    4 490.84 505.49 -241.420    482.84                         
## m1   11 118.59 158.89  -48.297     96.59 386.25  7  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##  model              emmean     SE   df lower.CL upper.CL
##  Gemini 2.5 Flash   0.6760 0.0641 7.44    0.526   0.8259
##  Claude 3.7 Sonnet  0.6302 0.0641 7.44    0.480   0.7800
##  Grok 3 Beta        0.6008 0.0641 7.44    0.451   0.7506
##  Claude 3.5 Sonnet  0.4713 0.0641 7.44    0.321   0.6211
##  GPT-4o Mini       -0.0888 0.0641 7.44   -0.239   0.0611
##  GPT-3.5 Turbo     -0.3191 0.0641 7.44   -0.469  -0.1693
##  Claude 3 Haiku    -0.4357 0.0641 7.44   -0.586  -0.2859
##  Command R         -0.4846 0.0641 7.44   -0.634  -0.3347
## 
## Degrees-of-freedom method: kenward-roger 
## Confidence level used: 0.95

## # A tibble: 12 × 3
##    role  mean_dri sd_dri
##    <chr>    <dbl>  <dbl>
##  1 coa     0.125   0.547
##  2 csk     0.287   0.550
##  3 ctr     0.189   0.457
##  4 dis     0.0416  0.564
##  5 eco     0.141   0.638
##  6 eld     0.149   0.531
##  7 far     0.0617  0.612
##  8 fis     0.0519  0.604
##  9 lan     0.170   0.506
## 10 par     0.111   0.608
## 11 sub     0.210   0.541
## 12 vil     0.0379  0.616
## # A tibble: 12 × 4
##    role  mean_role_noise max_role_noise min_role_noise
##    <chr>           <dbl>          <dbl>          <dbl>
##  1 coa             0.246          0.549        0.116  
##  2 csk             0.187          0.370        0.00776
##  3 ctr             0.299          0.402        0.106  
##  4 dis             0.217          0.369        0.00799
##  5 eco             0.233          0.517        0.0277 
##  6 eld             0.245          0.724        0.0452 
##  7 far             0.221          0.373        0.0647 
##  8 fis             0.192          0.566        0.0365 
##  9 lan             0.251          0.442        0.121  
## 10 par             0.304          0.559        0.0512 
## 11 sub             0.349          0.685        0.128  
## 12 vil             0.301          0.571        0.0186
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  sd_rep by role
## Fligner-Killeen:med chi-squared = 8.0891, df = 11, p-value = 0.7053
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  7  1.8873 0.08108 .
##       88                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  7  1.8873 0.08108 .
##       88                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6.4 Detailed exploration

7 Findings

7.1 Consistency

We compared the compared top with bottom models in terms of consistency of DRI and Cronbach’s Alpha (see top models in Figure 7.1 and bottom models in Figure 7.2).

7.1.1 Top models

Top models

Figure 7.1: Top models

We found that top LLMs are consistent across roles both in terms of DRI and Cronbach’s Alpha (policies). The high DRI across roles (median = 0.637; IQR = 0.161) suggests that LLMs tend to consistenly align their considerations and policy preferences. The high Cronbach’s alpha for their policy preferences (median = 0.784; IQR = 0.047) suggests that LLMs tend to agree on the ranking of their policy preferences.

7.1.2 Bottom models

Bottom models

Figure 7.2: Bottom models

We also found that bottom LLMs are not consistent across roles in terms of DRI and less consistent than top models in terms of Cronbach’s Alpha (policies). The low DRI across roles (median = -0.177; IQR = 0.163) suggests that LLMs tend to consistenly misalign their considerations and policy preferences. The Cronbach’s alpha (lower than top models) for their policy preferences (median = 0.635; IQR = 0.11) suggests that LLMs tend to agree less on the ranking of their policy preferences than top models.

7.1.3 Summary for each model

7.1.3.1 DRI

Table 7.1: Mean DRI across models and roles
role claude-3-5-sonnet-20241022 claude-3-7-sonnet-20250219 claude-3-haiku-20240307 command-r-08-2024 gemini-2.5-flash gpt-3.5-turbo gpt-4o-mini grok-3-beta best_model
1 all 0.512 0.639 -0.291 -0.281 0.638 -0.213 0.000 0.625 claude-3-7-sonnet-20250219
2 coa 0.350 0.565 -0.526 -0.435 0.810 -0.315 -0.019 0.567 gemini-2.5-flash
3 csk 0.543 0.773 -0.118 -0.580 0.875 0.163 -0.153 0.795 gemini-2.5-flash
4 ctr 0.343 0.567 -0.368 -0.264 0.663 -0.129 0.252 0.447 gemini-2.5-flash
5 dis 0.476 0.538 -0.553 -0.490 0.569 -0.719 0.057 0.455 gemini-2.5-flash
6 eco 0.364 0.720 -0.281 -0.831 0.854 -0.472 0.084 0.696 gemini-2.5-flash
7 eld 0.404 0.498 -0.335 -0.396 0.796 -0.078 -0.322 0.626 gemini-2.5-flash
8 far 0.479 0.651 -0.524 -0.673 0.821 -0.388 -0.370 0.497 gemini-2.5-flash
9 fis 0.497 0.593 -0.492 -0.560 0.685 -0.665 -0.244 0.602 gemini-2.5-flash
10 lan 0.595 0.633 -0.318 -0.347 0.477 -0.466 0.199 0.587 claude-3-7-sonnet-20250219
11 par 0.498 0.708 -0.669 -0.472 0.598 -0.164 -0.284 0.670 claude-3-7-sonnet-20250219
12 sub 0.526 0.712 -0.433 -0.218 0.556 -0.106 -0.014 0.654 claude-3-7-sonnet-20250219
13 vil 0.581 0.604 -0.612 -0.550 0.407 -0.490 -0.252 0.613 grok-3-beta

7.1.3.2 Cronbach’s Alpha (Policies)

Table 7.2: Mean alpha (policies) across models and roles
role claude-3-5-sonnet-20241022 claude-3-7-sonnet-20250219 claude-3-haiku-20240307 command-r-08-2024 gemini-2.5-flash gpt-3.5-turbo gpt-4o-mini grok-3-beta best_model
1 all 0.725 0.792 0.614 0.638 0.801 0.599 0.641 0.818 grok-3-beta
2 coa 0.713 0.745 0.816 0.808 0.771 0.737 0.763 0.807 claude-3-haiku-20240307
3 csk 0.783 0.802 0.813 0.708 0.848 0.764 0.715 0.851 grok-3-beta
4 ctr 0.749 0.791 0.774 0.776 0.918 0.787 0.727 0.755 gemini-2.5-flash
5 dis 0.761 0.772 0.669 0.802 0.771 0.762 0.756 0.796 command-r-08-2024
6 eco 0.764 0.844 0.711 0.730 0.814 0.800 0.759 0.716 claude-3-7-sonnet-20250219
7 eld 0.722 0.793 0.788 0.740 0.741 0.801 0.813 0.828 grok-3-beta
8 far 0.726 0.807 0.791 0.843 0.827 0.769 0.828 0.824 command-r-08-2024
9 fis 0.787 0.792 0.690 0.793 0.829 0.750 0.825 0.704 gemini-2.5-flash
10 lan 0.715 0.792 0.802 0.805 0.789 0.783 0.795 0.792 command-r-08-2024
11 par 0.785 0.704 0.774 0.777 0.790 0.778 0.762 0.833 grok-3-beta
12 sub 0.841 0.800 0.671 0.754 0.761 0.760 0.803 0.839 claude-3-5-sonnet-20241022
13 vil 0.708 0.818 0.770 0.794 0.808 0.786 0.798 0.662 claude-3-7-sonnet-20250219

7.1.3.3 Cronbach’s Alpha (Consideration)

Table 7.3: Mean alpha (considerations) across models and roles
role claude-3-5-sonnet-20241022 claude-3-7-sonnet-20250219 claude-3-haiku-20240307 command-r-08-2024 gemini-2.5-flash gpt-3.5-turbo gpt-4o-mini grok-3-beta best_model
1 all 0.990 0.990 0.976 0.975 0.984 0.911 0.976 0.987 claude-3-5-sonnet-20241022
2 coa 0.863 0.918 0.880 0.787 0.849 0.886 0.837 0.891 claude-3-7-sonnet-20250219
3 csk 0.769 0.856 0.898 0.767 0.551 0.952 0.817 0.831 gpt-3.5-turbo
4 ctr 0.916 0.909 0.872 0.915 0.852 0.916 0.852 0.906 claude-3-5-sonnet-20241022
5 dis 0.905 0.921 0.894 0.904 0.859 0.918 0.876 0.896 claude-3-7-sonnet-20250219
6 eco 0.900 0.860 0.884 0.827 0.842 0.865 0.871 0.863 claude-3-5-sonnet-20241022
7 eld 0.917 0.899 0.919 0.886 0.917 0.911 0.879 0.903 claude-3-haiku-20240307
8 far 0.905 0.848 0.919 0.747 0.815 0.774 0.860 0.905 claude-3-haiku-20240307
9 fis 0.916 0.895 0.894 0.907 0.896 0.918 0.891 0.905 gpt-3.5-turbo
10 lan 0.917 0.914 0.884 0.904 0.884 0.885 0.909 0.917 claude-3-5-sonnet-20241022
11 par 0.925 0.905 0.863 0.867 0.830 0.888 0.885 0.922 claude-3-5-sonnet-20241022
12 sub 0.902 0.919 0.895 0.758 0.851 0.889 0.906 0.911 claude-3-7-sonnet-20250219
13 vil 0.881 0.880 0.914 0.901 0.873 0.927 0.895 0.887 gpt-3.5-turbo

7.2 Model/Survey DRI Plots

These plots show a simulated deliberation across all 12 roles for each surveys and model. Each simulated deliberation has 60 participants (12 roles with 5 participants each).

Note that bottom models are visually inconsistent.

DRI Plots

Figure 7.3: DRI Plots

7.3 Survey/Role DRI Plots

These plots show a simulated deliberation across all models in the same class (i.e., top, bottom) for each role and survey. Each simulated deliberation has 20 participants (4 models with 5 participants each).

Note that top models are visually more consistent than bottom models.

7.3.1 Top models

7.3.2 Bottom models

7.4 Permutation tests

NOTE: This section is skipped by default. Remove the R code eval = FALSE to run the following chunks.

We conducted permutation tests with 10^{4} iterations to check which models are consistently consistent and which roles are consistently consistent.

7.4.1 Models and Surveys: Which models are truly consistent across roles?

In this permutation test, we explore the likelihood that the consistency, measured by DRI, is due to chance across surveys and roles.

Most models seem to be consistent across roles. Few of the 10,000 permutations led to a higher DRI than the observed DRI, suggesting that the observed value is likely not due to chance.

7.4.2 Surveys and Roles: Are models trully consistent across roles?

In this permutation test, we explore the likelihood that the consistency, measured by DRI, is due to chance across surveys and roles.


  1. Note that gemini-2.5-pro-preview-03-25 was replaced by gemini-2.5-pro, however, this version of the model became significantly slower and more expensive, since it has “thinking” enabled by default and cannot be toggled. As a result, we decided to use the flash version (gemini-2.5-flash), a lighter and cheaper alternative.↩︎